A Method for Eliminating Class Noise in Text Classification Based on Feature Class Attribute

نویسندگان

  • WANG Qiang
  • GUAN Yi
  • WANG Xiao-Long
چکیده

This paper presents a novel algorithm for eliminating class noise based on the analysis of the feature class attribute in text classification. The algorithm can eliminate class noise for classifier by mining the most representative class information of text features, which means that the algorithm can actively prejudge the candidate class labels to unseen documents using the class attribute linked to features and classify them in the candidate class spaces to reduce the number of decisions, retrench time expense, and promote accuracy. The experimental results on Chinese and English corpus show that the algorithm has good performance. The F measure is 0.76 and 0.93, respectively, and the run efficiency of classifier has been improved greatly. A further experiment indicates that the algorithm has good expansibility. Based on a certain feedback learning strategy, the F measure can be further improved to 0.806 and 0.943.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel One Sided Feature Selection Method for Imbalanced Text Classification

The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis. The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of t...

متن کامل

Connected Component Based Word Spotting on Persian Handwritten image documents

Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...

متن کامل

A New Framework for Distributed Multivariate Feature Selection

Feature selection is considered as an important issue in classification domain. Selecting a good feature through maximum relevance criterion to class label and minimum redundancy among features affect improving the classification accuracy. However, most current feature selection algorithms just work with the centralized methods. In this paper, we suggest a distributed version of the mRMR featu...

متن کامل

تحلیل ممیز غیرپارامتریک بهبودیافته برای دسته‌بندی تصاویر ابرطیفی با نمونه آموزشی محدود

Feature extraction performs an important role in improving hyperspectral image classification. Compared with parametric methods, nonparametric feature extraction methods have better performance when classes have no normal distribution. Besides, these methods can extract more features than what parametric feature extraction methods do. Nonparametric feature extraction methods use nonparametric s...

متن کامل

An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007